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(54) Method for alpha blending images utilizing a visual instruction set 



(57) An image alpha blending method utilizing a par- 
allel processor is provided. The computer-implemented 
method includes the steps of loading unaligned multiple 
word components into a processor in one machine in- 
struction, each word component associated with a pixel 



of an image; alpha blending the multiple word compo- 
nents of different source images and a control image in 
parallel; and storing the alpha blended multiple word 
components of a destination image into memory in par- 
allel. 
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Description 

COPYRIGHT NOTICE 

s A portion of the disclosure of this patent document contains material which is subject to copyright protection. The 

copyright owner has no objection to the xeroxographic reproduction by anyone of the patent document or the patent 
disclosure in exactly the form it appears in the Patent and Trademark Office patent file or records, but otherwise reserves 
all copyright rights whatsoever 

10 RELATED APPLICATIONS 

The present invention is related to U.S. Patent Application No. 08/236,572 by van Hook et al. t filed April 29, 1 994, 
entitled "A Central Processing Unit with Integrated Graphics Functions," as well as U.S. Patent Application No. 
08/ , (Atty Dckt No. P-1867) by Chang-Guo Zhou et aL, filed March 3, 1995, entitled "Color Format Conver- 
ts sion in a Parallel Processor," both of which are incorporated in their entirety herein by reference for all purposes. 

APPENDIX 

The appendix is a copy of the "visual Instruction Set User's Guide." 

20 

BACKGROUND OF THE INVENTION 

1 . Field of the Invention. 

25 The present invention relates generally to image processing and more particularly to blending two images to form 

a destination image. 

2. Description of the Relevant Art. 

30 One of the first uses of computers was the repetitious calculations of mathematical equations. Even the earliest 

of computers surpassed their creators in their ability to accurately and quickly process data. It is this processing power 

that make computers very wellsuited for tasks such as digital image processing. 

A digital image is an image where the pixels are expressed in digital values. These images may be generated from 

any number of sources including scanners, medical equipment, graphics programs, and the like. Additionally, a digital 
35 image may be generated from an analog image. Typically, a digital image is composed of rows and columns of pixels. 

In the simplest gray-scale images, each pixel is represented by a luminance (or intensity) value. For example, each 

pixel may be represented by a single unsigned byte with a range of 0-255, where 0 specifies the darkest pixel, 255 

specifies the brightest pixel and the other values specify an intermediate luminance. 

However, images may also be more complex with each pixel being an almost infinite number of chrominances (or 
40 colors) and luminances. For example, each pixel may be represented by four bands corresponding to R, G, B, and a. 

As is readily apparent, the increase in the number of bands has a proportional impact on the number of operations 

necessary to manipulate each pixel, and therefore the image. 

Blending two images to form a resulting image is a function provided by many image processing libraries, for 

example the XI L imaging library developed by SunSoft division of Sun Microsystems, Inc. and included in Solaris 
45 operating system. 

An example of image blending will now be described with reference to Figs. 1 A-D. In the simplest example, the 
two source images (srd and src2) are blended to form a destination image (d). The blending is controlled by a control 
image (a) the function of which is described below. All images are 1000 x 600 pixels and srd, src2, and d are one 
banded grey-scale images. 

50 Referring to Figs. 8A-D, the src2, src"1, destination, and control images are respectively depicted where srd is a 

car on a road, src2 is a mountain scene, and d is the car superimposed on the mountain scene. Each pixel in the d 
image is computed from corresponding pixels in the srd , src2, dst images according to the following formula: 

55 dst = a*src1 + (1-a)*src2 Eq.1 

where a is either 0, 1, or a fraction. The a values are derived from pixels in the control image which correspond to 
pixels in srd and src2. Thus, the calculations of Eq. 2 must be performed for each pixel in the destination image. 
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Thus, referring to Figs. 8A-D, the values of all pixels in the control image corresponding to the pixels in srd 
representing the car is "1 * and the value of all pixels in the control image outside the car is "0". Thus, according to Eq. 
1 , the pixels' values in the destination image corresponding to the "1" pixels in the control image would be represent 
the car and the pixels in the destination image corresponding to the "0" pixels in the control image would represent the 
mountain scene. In practice, the pixel values near the edge of the car would have fractional values to make the edge 
formed by the car and the mountain scene appear realistic. 

While the alpha blending function is provided by existing image libraries, typically the function is executed on a 
processor having integer and floating point units and utilizes generalized instructions for performing operations utilizing 
those processors. 

However, certain problems associated with alpha blending operations can cause the blending to be slow and 
inefficient when performed utilizing generalized instructions. In particular, most processors have a memory interface 
designed to access words aligned along word boundaries. For example, if the word is a byte (8 bits) then bytes are 
transferred between memory and the processor beginning at address 0 so that all addresses must be divisible by 8. 
However, image data tends to be misaligned, i.e., does not begin or end on aligned byte addresses, due to many 
factors including multiple bands. Further, words containing multiple bytes are usually transferred between memory and 
the processor and standard methods do not take advantage of the inherent parallelism due to the presence of multiple 
pixels in the registers. 

Known image blending techniques basically loop through the image and processing each pixel in sequence. This 
is a very simple process but for a moderately complex 3000x4000 pixel image, the computer may have to perform 1 92 
million instructions or more. This estimation assumes an image of 3000x4000 pixels, each pixel being represented by 
four bands and four instructions to process each value or band. This calculation shows that what appears to be a simple 
process quickly becomes very computational expensive and time consuming. 

As the resolution and size of images increases, improved systems and methods are needed that increase the 
speed with which computers may blend images. The present invention fulfills this and other needs. 

SUMMARY OF THE INVENTION 

The present invention provides innovative systems and methods of blending digital images. The present invention 
utilizes two levels of concurrency to increase the efficiency of image alpha blending. At a first level, machine instructions 
that are able to process multiple data values in parallel are utilized. At another level, the machine instructions are 
performed within the microprocessor concurrently. The present invention provides substantia! performance increases 
in image alpha blending technology. 

According to one aspect of the invention, an image alpha blending method of the present operations in a computer 
system and includes the steps of loading multiple word components of an unaligned image into a microprocessor in 
parallel, each word component associated with a pixel of an image; alpha blending the multiple word components in 
parallel; and storing the unaligned multiple word components of a destination image into memory in parallel. 

According to another aspect of the invention, a computer program product included a computer usable medium 
having computer readable code embodied therein for causing loading multiple word components of an unaligned image 
into a microprocessor in parallel, each word component associated with a pixel of an image; alpha blending the multiple 
word components in parallel; and storing the unaligned multiple word components of a destination image into memory 
in parallel 

Other features and advantages of the present invention will become apparent upon a perusal of the remaining 
portions of the specification and drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates an example of a computer system used to execute the software of the present invention; 

Fig. 2 shows a system block diagram of a typical computer system used to execute the software of the present 

invention; 

Fig. 3 is a block diagram of the major functional units in the UttraSPARC-l microprocessor; 

Fig. 4 shows a block diagram of the Floating Point/Graphics Unit; 

Fig. 5 is a flow diagram of a partitioned multiply instruction; 

Fig. 6 is a flow diagram of a partitioned add instruction; 

Figs. 7A-C are a flow diagrams of a partitioned pack instruction; 

Figs. 8A-D are depictions of source, destination, and control images; 

Fig. 9 is flow chart depicting a preferred embodiment of a method of alpha blending two images, and 

Fig. 1 0 is depiction of the modification of word components effected by the steps in a routine for calculating the 

destination image. 
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DESCRIPTION OF THE PREFERRED EMBODIMENT 

The following are definitions of some of the terms used herein. 

Pixel (picture element) - a small section or spot in an image where the image is on a computer screen, paper, film, 
5 memory, or the like. 

Byte - a unit of information having 8 bits. 

Word - a unit of information that is typically a 16, 32 or 64-bit quantity. 

Machine instructions (or code) - binary sequences that are loaded and executed by a microprocessor. 

In the description that follows, the present invention will be described with reference to a Sun workstation incor- 

10 porating an UltraSPARC-l microprocessor and running under the Solaris operating system. The UltraSPARC-l is a 
highly integrated superscaler 64-bit processor and includes the ability to perform multiple partitioned integer arithmetic 
operations concurrently. The UltraSPARC-l microprocessor will be described below but is also described in U.S. Ap- 
plication No. 08/236,572 by Van Hook et al, filed April 29, 1994, entitled 'A Central Processing Unit with Integrated 
Graphics Functions," which is hereby incorporated by reference for all purposes. The present invention, however, is 

is not limited to any particular computer architecture or operating system. Therefore, the description the embodiments 
that follow is for purposes of illustration and not limitation. 

Fig. 1 illustrates an example of a computer system used to execute the software of the present invention. Fig. 1 
shows a computer system 1 which includes a monitor 3, screen 5, cabinet 7, keyboard 9, and mouse 11. Mouse 11 
may have one or more buttons such as mouse buttons 13. Cabinet 7 houses a CD-ROM drive 15 or a hard drive (not 

20 shown) which may be utilized to store and retrieve software programs incorporating the present invention, digital images 
for use with the present invention, and the like. Although a CD-ROM 17 is shown as the removable media, other 
removable tangible media including floppy disks, tape, and flash memory may be utilized. Cabinet 7 also houses familiar 
computer components (not shown) such as a processor, memory, and the like. 

Fig. 2 shows a system block diagram of computer system 1 used to execute the software of the present invention. 

25 As in Fig. 1, computer system 1 includes monitor 3 and keyboard 9. Computer system 1 further includes subsystems 
such as a central processor 102, system memory 104, I/O controller 106, display adapter 108, removable disk 112, 
fjxed disk 116, network interface 118, and speaker 120. Other computer systems suitable for use with the present 
invention may include additional or fewer subsystems. For example, another computer system could include more than 
one processor 102 (i.e., a multi-processor system) or a cache memory. 

30 Arrows such as 122 represent the system bus architecture of computer system 1. However, these arrows are 

illustrative of any interconnection scheme serving to link the subsystems. For example, a local bus could be utilized to 
connect the central processor to the system memory and display adapter. Computer system 1 shown in Fig. 2 is but 
an example of a computer system suitable for use with the present invention. Other configurations of subsystems 
suitable for use with the present invention will be readily apparent to one of ordinary skill in the art. 

35 Fig. 3 is a block diagram of the major functional units in the UltraSPARC-l microprocessor. A microprocessor 140 

includes a front end Prefetch and Dispatch Unit (PDU) 142. The PDU prefetches instructions based upon a dynamic 
branch prediction mechanism and a next field address which allows single cycle branch following. Typically, branch 
prediction is better than 90% accurate which allows the PDU to supply four instructions per cycle to a core execution 
block 144. 

40 The core execution block includes a Branch Unit 145, an Integer Execution Unit (IEU) 146, a Load/Store Unit (LSU) 

148, and a Floating Point/Graphics Unit (FGU) 150. The units that make up the core execution block may operate in 
parallel (up to four instructions per cycle) which substantially enhances the throughput of the microprocessor. The IEU 
performs the integer arithmetic or logical operations. The LSU executes the instructions that transfer data between the 
memory hierarchy and register files in the IEU and FGU. The FGU performs floating point and graphics operations. 

45 Fig. 4 shows a block diagram of the Floating Point/Graphics Unit. FGU 150 includes a Register File 152 and five 

functional units which may operate in parallel. The Register File incorporates 32 64-bit registers. Three of the functional 
units are a floating point divider 154, a floating point multiplier 156, and a floating point adder 158. The floating point 
units perform all the floating point operations. The remaining two functional units are a graphics multiplier (GRM) 1 60 
and a graphics adder (GRA) 1 62. The graphical units perform all the graphics operations of the Visual Instruction Set 

50 (VIS) instructions. 

The VIS instructions are machine code extensions that allow for enhanced graphics capabilities. The VIS instruc- 
tions typically operate on partitioned data formats. In a partitioned data format, 32 and 64-bit words include multiple 
word components. For example, a 32-bit word may be composed of four unsigned bytes and each byte may represent 
a pixel intensity value of an image. As another example, a 64-bit word may be composed of four signed 16-bit words 
55 and each 16-bit word may represent the result of a partitioned multiplication. 

The VIS instructions allow the microprocessor to operate on multiple pixels or bands in parallel. The GRA performs 
single cycle partitioned add and subtract, data alignment, merge, expand and logical operations. The GRM performs 
three cycle partitioned multiplication, compare, pack and pixel distance operations. The following is a description of 
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some these operations that may be utilized with the present invention. 

Fig. 5 is a flow diagram ot a partitioned multiply operation. Each unsigned 8-bit component (i.e., a pixel) 202A-D 
held in a first register 202 is multiplied by a corresponding (signed) 16-bit fixed point integer component 204A-O held 
in a second register 204 to generate a 24-bit product. The upper 16 bits of the resulting product are stored as corre- 
5 sponding 16-bit result components 205A-D in a result register 206. 

Fig. 6 is a flow diagram of a partitioned add/subtract operation. Each 16-bit signed component 202A-D held in the 
first register 202 is added/subtracted to a corresponding 16-bit signed component 204A-D held in the second register 
to form a resulting 16-bit sum/difference component which is stored as a corresponding result component 205A-D in 
a result register 205. 

10 Fig. 7 is a flow diagram of a partitioned pack operation. Each 16-bit fixed value component 202A-D held in a first 

register 202 is scaled, truncated and clipped into an 8-bit unsigned integer component which is stored as a correspond- 
ing result component 205A-D in a result register 205. This operation is depicted in greater detail in Figs. 7B-C. 

Referring to Fig. 7B, a 16-bit fixed value component 202A is left shifted by the bits specified by a GSR scale factor 
held in GSR register 400 (in this example the GSR scale factor 10) while maintaining clipping information. Next, the 

15 shifted component is truncated and clipped to an 8-bit unsigned integer starting at the bit immediately to the left of the 
implicit binary point (i.e., between bits 7 and 6 for each 16-bit word). Truncation is performed to convert the scaled 
value into a signed integer (i.e., round to negative infinity). Fig. 7C depicts an example with the GSR scale factor equal 
to 8. 

20 ALPHA BLENDING 

Fig. 9 is a flow chart depicting a preferred embodiment of a method of alpha blending two images. In Fig. 9, multi- 
component words, with each component associated with a pixel value, are loaded from unaligned areas of memory 
holding the srd , src2, and control images. The components of these multi-component words are processed in parallel 
25 to generate components of a multi-component word holding pixel values of the destination image. The components of 
a destination word are stored in parallel to an unaligned area of memory. 

Accordingly, except for doing different arithmetic as dictated by specified precision values and definitions of par- 
titioned operations, each of the routines described below, for each line of pixels, loops through the data by doing, load 
data, align data, perform arithmetic in parallel, before and after the loop, deal with edges. 

30 

Loading Misaligned Image Data 

The use of the visual instruction set to load the srd , src2, and alpha images into the registers of the GPU will now 
be described. For purpose of illustration it is assumed that the srd image data begins at Address 16005, src2 begins 
35 at Address 24003, and dst begins at 8001 . Accordingly, neither srd , src2, or dst begins on an aligned byte address. 
In this example, all images are assumed to comprise 8-bit pixels and have only a single band. 

For purposes of explanation the VIS assembly instructions are used in function call notation. This implies that 
memory locations instead of registers are referenced, hence aligned toads will implied rather than explicitly stated. 
This notation is routinely used by those skilled in the art and is not ambiguous. 
40 The special visual instructions utilized to load the misaligned data are alignaddr(addr, offset) and falingndata 

(data_hi,data_lo). The function and syntax of these instructions is fully described in Appendix B. The use of these 
instructions in an exemplary subroutine for loading misaligned data will be described below. 

The function of the allgnaddr() instruction is to return an aligned address equal to the nearest aligned address 
occurring before the address of misaligned address and write the offset of the misaligned address from the returned 
45 address to the GSR. The integer in the offset argument is added to the offset of the misaligned address prior to writing 
the offset to the GSR. 

For example, as stated above, if the starting Address for srd is 1 6005, the a)ignaddr(1 6005,0) returns the aligned 
address of 1 6000 and writes 5 to the GSR. 

The function of faligndata() is to load an 8-byte double word beginning at a misaligned address. This is accorn- 
50 pushed by loading two consecutive double words having aligned addresses equal to data_hi and data Jo and using 
the offset written to the GSR to fetch 8 consecutive bytea with the first byte offset from the aligned boundary by the 
number written to the GSR. 

If s1 is the address of the first byte of the first 64 bit word of the misaligned srd data then the routine: 

55 s1_aligned = aligrtaddr(s1, 0) 

u_s1_0 = s1_a!igned[1] 
u-s 1-1 = s1_aligned[2] 
dbl_s1 =faligndata(u_s1_0, u_s1_t) 
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sets s1_aligned to the aligned address preceding the beginning to the first 64 bit word, i.e., 16000, sets u_s1_0 equal 
to the first word aligned address and u_s1_1 equal to the second word aligned address, i.e., 160000 and 16008, and 
returns the first misaligned word of srd as dbl„s1 . 

This routine can be modified to return the misaligned pixels from srd , db\_s2 t and from the control image, quad_a, 
s and included in a loop to load all the pixels of srd , src2, and the control image. 

Calculating the Destination Image 

1. Pixel Length of Srd, Src2, and Control Image is 1 Byte (8 bits) 

10 

In this example the pixels, a, in the control image are 8-bit unsigned integers having values between 0-255. Thus, 
Eq. 1 is transformed into: 



s dst = a/256*src1 + (1 -a/256) *src2 Eq. 2. 

or the destination image can be calculated from: 



2Q dst = src2 + (srd - src2)*a/256 Eq. 3 

The following routine utilizes visual instructions that provide for parallel processing of 4 pixel values per operation 
with each pixel comprising 8 bits. Additionally, as will become apparent, the routine eliminates the requirement of 
explicitly dividing by 256 thereby reducing the processing time and resources required to calculate the pixel values in 
25 the destination image. 

ROUTINE 2 

dbl_s1_e s fexpand(read_hi(dbLs1 )); 
30 dbl_s2_e s fexpand(read_hi(dbl_s2)); 

dbLtmp2 = fsub16(dble_s2_e, dbl_s1_e); 
dbl_tmp1 = fmul8x16(read_hi(quad_a), dbl_tmp_2); 
dbl_sum1 =fpadd16(dbLs1_e, dbl_tmp1); 

35 dbl_s1 _e = fexpand(readJo(dbl_s1 )); 

dbl_s2_e = fexpand(read Jo(dbl_s2)); 

dbl_tmp2 = fsub16(dble_s2_e, dbl_s1_e); 

dbl_tmp1 = fmuI8x16(readJo(quad_a), dbl_tmp_2); 

dbl_sum2 =fpadd16(dbl_s1_e, dbl_tmp1); 
40 dbl_d =freg_pair(fpack16(dbl_sum1, dbl_sum2) 

The functions of the various instructions in routine 2 to calculate the pixel values in the destination image will now 
be described. The variables dbl_s1, dbl_s2, and quad_a are all 8-byte words including 8 pixel values. As will be 
described more fully above, each byte may be a complete pixel value or a band in a multiple band pixel. 

45 Fig. 10 depicts the modifications to the word components for each operation in the routine. As depicted in Fig. 1 0, 

the function of fexpand(read_hi(dbl_s1) is to expand the upper 4-bytes of dbl_s1 into a 64 bit word having 4 16-bit 
partitions to form dbl_s1_e. Each 16-bit partition includes 4 leading 0's, the a corresponding byte from dbl_s1 and 4 
trailing 0's. The variable dbl_s2__e used to calculate dbl- sum1 is similarly formed and the variables dbLsl_e and 
dbl_s2_e used to calculate dbl_sum2 are formed by expanding the lower 4 bytes of the corresponding variables. 

so The function of fsub16(dbl_s2 , dbl_$1_e) is to calculate the value (src2 - srd). This instruction performs parti- 

tioned subtraction on the 4 16-bit components of dbl_s2 and dbl_s1 to form dbl__tmp2. 

The function of fmu!8x16(read_hi(quad_a, dbLtmp2» is to calculate the value a/256* (src2 - srd ). This instruc- 
tion performs partitioned multiplication of the upper 4 bytes of quad_a and the 4 16-bit components of dbl_tmp2 to 
form a 24-bit result and truncates the lower 8 bits to form a 16 bit result. Note that the upper 4 bits of each 16 bit 

55 component of dbl-tmp2 are 0's, because of the f expand operation, the lower 4 bits of the 24 bit product are also 0's. 
The middle 16 bits are the result of multiplying the byte expanded from dbl-s1 and the corresponding byte from quad_a 
to form the product of a*(src2 - srd). Thus, the truncation of the lower 8 bits removes the lower 4 0's, resulting from 
the previous expansion, and the lower 4 bits of the product to effect division by 256 to form dbl_tmp1 equal to a/256* 
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(src2 - srd). 

The function of f paddl 6(dbl_s1_e, dbl_tmp1 ) is to calculate cx/256*(src2 - srd ) + src2. This instruction performs 
partitioned addition on the 16-bit components of its arguments. 

The function of freg_pair(pack16(dbLsum1 l pack 16(dbl_sum2) is to return an 8-byte (64 bits) including 8 pixel 
5 values of the destination image. The instruction fpack16 packs each 16-bit component into a corresponding 8-bit 
component by left-shifting the 1 6-bit component by an amount specified in the GSR register and truncating to an 8-bit 
value. Thus, the left shift removes the 4 leading 0's and the lower bits are truncated to return the 8 significant bits of 
the destination in each of the 4 components. The function of f reg-pair is to join the two packed 32 bit variables into a 
64-bit variable. 

10 

2. Pixel Length of Srd, Src2, and Control Image is 16 bits. 

For 16 bits a second routine is utilized which takes into account different requirements of precision. 

is ROUTINE 2A 

dbl.half short = 0x80008000 
dbl_mask_255 = OxOOffOOff 

20 compute (1 -r)*s1 

dbl_a = fsub16(dbLa, dbljialf short); 
(void) allgnaddr(d_aligned, seven); 
dbl_tmp1 =faligndat(dbl_mask_255, dbl_a); 
25 dbl_tmp2 = f and(dbljmp1 , dbl_mask_255); 

fltjil = fpack16(dbLtmp2); 

dbl_tmp2 =fmul8ulx16(dbl_a, dbl_s1); 
dbl_tmp1 = fmul8x16(flt_hi, dbl_s1); 
30 dbl_surn2 = fpadd16(dbl_tmp1, dbl_tmp2); 

dbl_sunn1 = fpsub16(sbl_s1, dbl_sum2); 

compute r*s2 

35 dbl_tmp2 = fmul8ulx16(dbl_a, dbl_s2); 

dbl_tmp1 =fmul8x16(flt_hLa, db1_s2); 
dbl_sum2 =fpadd(dbl_tmp1, dbl_tmp2); 

dbl_d =f paddl 6(dbl_sum1, dbl_sum2); 
40 Except for f mu1 8ulx1 6 and fand the operations in routine 2A are the same as in routine 2 modified to operate on 

16 bit components. The description of the functions of those operations will not be repeated. 

The function of 1mul18ulx16(dbl_a, dbl_s2) is to perform it to perform a partitioned multiplication the unsigned 
lower 8 bits of each 16-bit component in the arguments and return the upper 16 bits of the result for each component 
as 16-bit components of dbl-tmp2. 
45 The function of fand(dbMmp1, dbl_mask_255) is to perform a logical AND operation on the variables defined 

by the arguments. 

Storing the Misaligned Destination Image Data 

1. Loading Utilizing an Edge Mask and Partial Instruction. 

so 

The following routine calculates an edge mask and utilizes a partial store operation to store the destination image 
data where d is a pointer to the destination location, d_aligned is the aligned address which immediately precedes d, 
and width is the width of a destination word. 

55 ROUTINE 3 

d_end = d + width - 1; 
emask = edge_8(d, d_end); 
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pst_8(dbl_d, (void *)d_aligned, emask); 
++d_allgned; 

emask = edge8(d_aligned, d_end); 

5 The function of emask is to generate a mask for storing unaligned data. For example, if the destination data is to 

be written to address 0x10003, and the previous aligned address in 0x10000, then the emask will be [00011111 J and 
the pst instruction will start writing at address 0x10000 and emask will disable writes to 0x10000, 0x10001, and 
0x10002 and enable writes to 0x10003-0x10007. Similarly, after emask is incremented the last part of dbl_d is written 
to the 0x10008, 0x10009, and 0x1000Aand the addresses Ox1000B-Ox1000F will be masked. 

10 
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232 Floating Point /Graphics Unit (FGU) 



The Floating-Point and Graphics Unit (FGU) as illustrated in Figure 2-4 integrates 
five functional units and a 32 registers by 64 bits Register File. The floating-point 
adder, multiplier and divider perform all FP operations while the graphics adder 
and multiplier perform the graphics operations of the Visual Instruction 5et. 



DispatchJfntr^ 



5 read addresses 



Floating-Point 

Graphics 
Register File 



3x64 




Figure 2-1 Floating Point and Graphics Unit 
Draft October 4. 1995 



Sun Microsystems. Inc. 

II 
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Visuci insmtcr.on Set User's Guide 

A maximum of two floating-point/ graphics Operations CFGops) and one FP 
load /store operation are executed in every cycle (plus another integer or branch 
ins auction). All operations, except for divide and square-root, are fully pipelined. 
Divide and square-root operations complete out-of-order without inhibiting the 
concurrent execution of other FGops.The two graphics units are both fully pipe- 
lined and perform operations on 8 or 16-bit pixel components with 16 or 32-bit 
intermediate results. 

The Graphics Adder performs single cycle partitioned add and subtract, data 
alignment, merge, expand and logical operations. Four 16-bit adders are utilized 
and a custom shifter is implemented for byte concatenation and variable byte- 
length shifting. The Graphics Multiplier performs three cycle partitioned multi- 
* * piication. compare, pack and pixel distance operations. Four 8x16 multipliers are 
utilized and a custom shifter is implemented. Eight 8-bit pixel subtractions, abso- 
lute values, additions and a final alignment are required for each pixel distance 
operation. 



SPARC Technology Business. Draft October 4. 1995 
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Visual Irtstruction Set User's Guide 

4.3 .3 visJregjpairO 

Function 

Join two vis_f32 variables into a single vis_d64 variable. 
Syntax 

via_d64 via_f reg_pair<via_f32 data 1^32, vis_f32 data2^32) ; 
Description 

vis_f reg^pairi) joins two vis_f32 values 4atel_32 and data2_32 into a single 
vis_d64 variable. This offers a more optimum way of performing the 
equivalent of using vis_write_hiO and vis_write _k>0 since the compiler 
attempts to minimize the number of floating point move operations by 
strategically using register pairs. 

Example 

vis_f32 datal_32, data2_32; 
vis_d64 data_64; 

/* Produces data_64, with datal_32 as the upper and data2_32 as the 
lower component.*/ 

data_64 - vis^freg^paxr (datal_ 32, data2_32) ; 
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4 .5 Pixel Compare Instructions 
4.5.1 visjcmp[gt, le, eq, ne. It, ge][1632}() 

Function 

Perform logical comparison berween two partitioned variables and 
generate an integer mask describing the result of the comparison. 

Syntax 
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Description 

visjcmptfgt, le, eq, neq4t,ge]0 compare four 16 bit partitioned or two 32 
bit partitioned fixed-point values within datal _4J16, dntal_2_32 and 
data2_4_lo. datal JLJ52. The 4 bit or 2 bit comparison results are returned in 
the corresponding least significant bits of a 32 bit value, that is typically 
used as a mask. A single bit is returned for each partitioned compare and 
in both cases bit zero is the least significant bit of the compare result. 

For vis_ranptgtO, each bit within the 4 bit or 2 bit compare result is set if 
the corresponding value of [data! _4_16. fatal _2_32] is greater than the 
corresponding value of idata2_4_16. data2_2_32\. 

For vis_fcmptle(), each bit within the 4 bit or 2 bit compare result is set if 
the corresponding vaiue of \daial_ijl&. datal _2_32) is less than or equal to 
the corresponding vaiue of [iata2_4Jo\ data2_2_32. 

For visJcmpteqO, each bit within the 4 bit or 2-bit compare result is set if 
the corresponding vaiue of [data] _4_16. datal_2_ m 32] is equal to the 
corresponding value of [datal _4Jl 6. 4ata2_2_32J. 

For visJanptneO, each bit within the 4 bit or 2 bit compare result is set if 
the corresponding vaiue of [dotal _4 J 5. datal _2_32) is not equal to the 
corresponding vaiue of [data2_4J6, data2_2_32}. 
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For vis.fanptltO. each bit within the 4 bit or 2 bit compare result is set if 
the corresponding value of Idatal_4_l6, datal_2_32l less than the 
corresponding value of [data2_4_16, data2_2_32). 

For vis_fanptgeO each bit within the 4 bit or 2 bit compare result is set if 
the corresponding value of [datal_4_16, datal_2_32J is greater or equal to 
the corresponding value of [data2_4_l6, data2_Z_32]. 

The four 16 bit pixel comparison operations are illustrated in Figure 4-4 
and the two 32 bit pixel comparison operations are illustrated in Figure 4-5. 
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Figure 4-4 Four 16 bit Pixel Comparison Operanons 
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Figure 4*5 Two 32 bit Pixel Comparison Operanon 
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Example 

int mask; 

vis_d64 datal_4_16, data2_4_16, datal_2_32, dat&2^2_32 ; 

mask - vis_£anptgtl6 (datal_ 4_16 # data2_4 16); 
/* datal^a^lS > data2_4_16~*/ 

mask • vis_fcmptie 16 (datal 4_16, data2 4 16); 
iQ /* datal_4_16 <- data2_4_l? •/ 

mask - vis_fcmptlel6(datal_4_16, data2_4_16); 
/♦ datal_4~16 >- data2_4_16 "/ 

mask - vi3_fcmpteql6(datal_4_16, data2_^4 16); 
15 /* datal_4~16 - data2_4_16~ # 7 ~ 

mask - via fcmptnc 16 (datal 4 16, data2 4 16); 
/* datal_4~16 - data2_4_16~ # 7 

mask - via f cnpltl 6 (data! 4 16, data2 4 16); 
20 /* datal_4~16 < data2_4_16 •/ 

mask - vis^fc m pgLl 6 (datal 4 16, data2 4 16); 

/♦ datal_4_16 > data2_4_16 "/ » ' 

2S /* may be used as an argument to a partial store instruction 

vis_pst_8, vis_pst 16 or vis_pst_32*/ 
vis_p3t_a£<datal_4~16 f fidata2_4_16, mask); 

/* Stores. the greater of data 1 4_16 or data2 4 16 overwriting 
data2_4_16 •/ 

30 4.6 Arithmetic Instructions 

The VIS arithmetic instructions perform partitioned addition, subtraction or mul- 
tiplication. 

4.6.2 vis_fpadd[16, 16s, 32, 32s](), visjpsub[16, 26s, 32, 32s]() 

Function 

40 Perform addition and subtraction on two 16 bit, four 16 bit or two 32 bit 

partitioned data. 

Syntax: 

vis_d64 vis_fpaddl6(vis_d64 datal_4^16, vis d64 data 2 4 16); 
45 vis_d64 via_£p3ubl6(vis3d64 datal^O*' vis"d64 datafflG) ; 

vi 3 _d64 vis_fpadd32(vis~d64 data 1^2^32, vis~d64 dat*2~2~32); 
vxs_d64 vi 3 ~f paU b32<vis~d64 dataf_f_32, vis~d64 data 2" f 32) ; 
vi3_f32 vi 3 _fpaddl63(vil_f32 datal J2_16, vis*_f32 datal 1_1G) ; 
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via_£32 Vi3_fp3ubl6a<vi3_f32 d*t&l_2_16, via_£32 data2_2JO; 
vi 3 _£32 via_fpadd323<vi3_f32 <&tdi_ i_J2, via_£32 data2_2_J2) ; 
via_£32 vi3_£paub32a (via_£32 d&t&l_l_32, via_£32 data2_i_J2); 

Description 

vis_fpaddl60 and vis_fpsuM60 perform partitioned addition and 
subtraction between two 64 bit partitioned variables, interpreted as four 16 
bit signed components, dotal_4_l6 and dato2^4Jl6 and return a 64bit 
partitioned variable interpreted as four 16 bit signed components, $umji_ 
16 or difference_4_16. vis_fpadd32{) and vis_fpsub320 perform partitioned 
addition and subtraction between two 64 bit partitioned components, 
interpreted as two 32 bit signed variables, datal_2J2 and data2J2J*2 and 
return a 64 bit partitioned variable interpreted as two 32 bit components, 
sum^32 or difference_2J2. Overflow and underflow are not detected and 
result in wraparound. Figure 4*6 illustrates the vis_fpaddl60 and 
vis_fpsubl60 operations. Figure 4-7 illustrates the vis_fpadd320 and 
vis_fpsub320 operation. The 32 bit versions interpret their arguments as 
two 16 bit signed values or one 32 bit signed value. 

Hie single precision version of these instructions vi3_fpaddl6e0, 
vis_fpsublfisO, vis_fpadd32s0, vis _fpsub32s0 perform two 16-bit or one 
32-bit partitioned adds or subtracts. Figure 4-8 illustrates the 
vis^fpaddlSsO and vis_fpsubl6s0 operation and Figure 4-9 illustrates the 
vis_fpadd32s0 and vis_fpsub32s0 operation. 
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Figure 4-6 vis_jfpaddl60 and vis Jpsubl60 operation 
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Figure 4*7 vis Jpadd320 and visjpsub320 operation 
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Figure 4-8 vis_fpaddl6s0 and vis Jpsubl6s0 operation 
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Figure 4-9 visjpadd32s0 and vi$Jpsub32s() 
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Example 

via_d64 datal_4_16, data2_4_16 r datal_2_32, data2_2_32; 
via_d€4 axan_4_16, di££erence_4_lG f awa_2_32 f di££ereace_2_ 32; 
via_f32 datal_2_16, data2_2_16, sum_2_16, diff ercnce^^lS; 
via_f32 datal_l_32 , data2_l_32, atnn_l_32, difference 1_32? 

aum_4_16 - via_fpaddl6 <datal_4J.6, data2_4_16) ; 
di£ference_4_16 - via_fpaubl6 (datal_4_16, data2_4_16) ; 
aum_2_32 - via_fpaum32 <datal_2_32 r data2_2_32> ; 
di£ference_2_32 - via_fpaub32 (datal_2_32, data2_2_32) ; 
aum_2_16 - via_fpaddl6a<datal_2_16, data2_2_16) ; 
di££erence_2_16 - via_fpoub!6a <datal_2_lS, data2_2_16) ; 
atna_l_32 - via_fpadd32a <datal_l_32, data2_l_32>; 
difference 1_32 • via_£paub32a (datal_l - 32, data2_l_32) ; 

4.6.2 vi$Jmul8xl6() 

Function: 

Multiply the elements of an 8 bit partitioned visJ32 variable by die 
corresponding element of a 16 bit partitioned visjd64 variable to produce a 
16 bit partitioned visjd64 result 

Syntax: 

via_d64 vi3_£mulBxl6 <via_£32 pixels, via_d€4 scale); 
Description 

vis_fmul8xl6Q multiplies each unsigned 8-bit component within pixels by 
the corresponding signed 16-bit fixed-point component within scale and 
returns the upper 16 bits of the 24 bit product (after rounding) as a signed 
16-bit component in the 64 bit returned value. Or in other words: 

16 bit result = (8 bit pixel demenr > 16 bit scale element + 128)/256 

The operation is illustrated in Figure 4-10. 

This instruction treats the pixels values as fixed-point with the binary point 
to the left of the most significant bit For example, this operation is used 
with filter coefficients as the fixed-point scale value, and image data as the 
pixels value. 

SPARC Technology Business Draft October 4, 1995 

56 



17 



EP 0 790 581 A2 



pixels 





/ 


/ 


\ 



scale 




result 
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Figure 4-20 vis_fmulSxl6() Operation 

Example 

vis_ f32 pixels; 
vis_d64 result, scale; 



result vi3_fmul 8x1 fi (pixels, scale); 
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4.6.4 visjmul8suxl6(), visjmul8ulxl6() 

Function 

Multiply the corresponding elements of two 16 bit partitioned vis_d64 
variables to produce a 16 bit partitioned vis_d64 result 

Syntax 

vis_d64 vis_fsrul83UxlG (vi3_d64 datal_lo, vi3_d64 daza2-l 6) ; 
vis_d64 vx3_fmulBulxl6 (via~d64 datal_i£, vxs_d$4 daca2_16); 

Description 

Both vis_fmul8suxl60 and vis_fmulflulacl60 perforin "half" a 
multiplication. £mul8suxl60 multiplies the upper 8 bits of each 16-bit 
signed component of fatal by the corresponding 16-bit fixed point 
signed component in data2JL_l6. The upper 16 bits of the 24-bit product 
are returned in a 16^-bit partitioned resultu. The 24 bit product is rounded to 
16 bits . The operation is illustrated in Figure 4-13. 

vis_fmul8ulxl6() multiplies the unsigned lower 8 bits of each 16-bit 
element of datal_4_l6 by the corresponding 16 bit element in data2_4_l6. 
Each 24-bit product is sign-extended to 32 bits'. The upper 16 bits of the 
sign extended value are returned in a 16-bit partitioned resultL The 
operation is illustrated in Figure 4-14. 

Because the result of £mul8ulxl60 is conceptually shifted right 8 bits 
relative to the result of fmul8suxl60 they have the proper relative 
significance to be added together to yield a 16 bit produa of datal„4Jl6 
and data2_4_l6. 

Each of the "partitioned multiplications " in this composite operation, 
multiplies two 16-bit fixed point numbers to yield a 16-bit result, i.e. the 
lower 16-bits of the full precision 32-bit result are dropped after rounding. 
The location of the binary point in the fixed point arguments is under 
user's control. It can be anywhere from to the right of bit 0 or to the left of 
bit 15. 
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For example, each of the input arguments can have 8 fractional bits. i.e. the 
binary point between bit 7 and bit 8. If a full precision 32-bit result were 
provided, it would have 16 fractional bits. i.e. the binary point would be 
between bits 15 and 16. Since, however, only 16 bits of the result are 
provided, the lower 16 fractional bits are dropped after rounding. The 
binary point of the 16-bit result in this case is to the right of bit 0. 

Another example, illustrated below, has 12 fractional bits in each of if s 2 
component arguments, i.e. the binary point is between bits 11 and 12. A 
full precision 32-bit result would have 24 fractional bits. i.e. the binary 
point between bits 23 and 24. Since, however, only a 16-bit result is 
provided, the lower 16 fractional bits are dropped after rounding, thus 
providing a result with 8 fractional bits. i.e. the binary point between bits 
7 and 8. 



0101.001010010101 (- 5.161376953125) 
x 0001.011001001001 (- 1.392822265S25) 



00000111.00110000 (- 7.188890741596) 
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Figure 4-13 vis_fmul8suxl6() operation 
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4. Using the VIS 
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Figure 4-34 visjfmul8ulxl60 operation 

Example 

vis_d64 datal_4_16, data2_ 4_16, resultl, reaultu, result; 

reaultu - via^fnrulQauxie (data8_8, data4_16) ; 
result! - vis_fmul8ulxl6 (data8 , datal6); 

result - visfpaddl 6 (reaultu, resultl); /* 16 bit result of a 16 # 16 
multiply */ 
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4.7.1 vis _fpackl6() 

Function 

Truncates tour 16 bit signed components to four 3 bit unsigned 
components. 

Syntax 

via_£32 f?acxl6<vj.a_d64 data_ ; 
Description 

vis_fpackl60 takes four 16-bit fixed components within data_4_l6. scales, 
truncates and clips them into four 8-bit unsigned components and re rums 
a vis__f32 result, this is accomplished by left shifting the 16 bit component 
as determined from the scale factor field of GSR and truncating to an 8-bit 
unsigned integer by rounding and then discarding the least significant 
digits. If the resulting value is negative u.e.. the MSB is set), zero is 
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returned. If the value is greater than 255, then 255 is returned. Otherwise 
the scaled value is returned. For an illustration of this operarioin see 
4.7.2. 
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Figure 4-17 vis_tpackl60 operauon 

Example 

via_d64 data_4_16; 
vi3_£32 pixeia; 

pixe±3 - vxs_rpackl6 (data_C_16) ; 
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47.2 visjpack32() 

Function 

Truncate two 32 bit fixed values into two unsigned 8 bit integers. 
Syntax 

via_d€4 vis_fpack32 (vis_d64 data_<2_ 22, vis_d64 pixels) ; 
Description 

vis_fpack320 copies its second argument, pixels shifted left by 8 bits into 
the destination or vis_d64 return value. It then extracts two 8 bit quantities, 
one each from the two 32-bit fixed values within datn_2J$2 f and overwrites 
the least significant byte position of the destination. Two pixels consisting 
of four 8 bit bytes each may be assembled by repeated operation of 
vis_fpadc32 on four daia_2_J2 pairs. 

The reduction of data_2_32 from 32 to 8 bits is controlled by the scale factor 
of the GSR. The initial 32-bit value is shifted left by the 
GSR. scale_f actor, and the result is considered as a fixed-point number with 
its binary point between bits 22 and 23. If this number is negative, the 
output is damped to 0; if greater than 255, it is clamped to 255. Otherwise, 
the eight bits to the left of the binary point are taken as the output. 

Another way to conceptualize this process is to think of the binary point as 
lying to the left of bit (22 - scale factor) i.e(., 23 - scale factor) bits of 
fractional precision. The 4-bit scale factor can take any value between 0 and 
15 inclusive. This means that 32-bit partitioned variables which are to be 
packed using vis_fpack320 may have between 8 and 23 fractional bits. 

The following code examples takes four variables red, green, blue, and 
alpha, each containing data for two pixels in 32-bit partitioned format 
(rOrl, gOgl, bGbl, aOal), and produces a vis_d64 pixels containing eight 
8 bit quantities (rOgObOaOrlglblal). 

via_d64 red, green, blue, alpha, pixels; 

/•red, green, blue, and alpha contain data for 2 pixels*/ 

pixels - vis_fpack32 (red, pixels); 
pixels - vis_fpack32 (green, pixels); 
pixels - vis_£pack32 (blue pixels); 
pixels - vis_f pack32 (alpha, pixels) ; 

/• The result -3 two sets of red, green, blue and alpha values packed 
in pixels */ 
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Figure 4*2 S vis_fpack32() operation 

4.7.3 JpackfixO 

Function 

Converts two 32 bit partitioned data to two 16 bit partitioned data. 
Syntax 

vis_f32 fpackfix(vis_d64 <iaza_2_32 , ) ; 
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Description 



vis.fpackfixO takes two 32-bit fixed components within data_2J52 r scales, 
and runcates them into two 16-bit signed components. This is 
accomplished by shifting each 32 bit component of data_2J52 according to 
GSR.scale-factorand then truncating to a 16 bit scaled value starting 
between bits 16 and 15 of each 32 bit word. Truncation converts the scaled 
value to a signed integer (i.e. rounds toward negative infinity). If the value 
is less than -32768, -32768 is returned. If the value is greater than 32767, 
32767 is returned. Otherwise the scaled data_2J[S value is returned. 
Figure 4-19 illustrates the vis_fpackfixO operation. 



Example 



vxs_d64 data_2_32; 
via £32 data 2 16; 



data_2_16 - via_£paclc£ix (data_2_32) ; 
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Figure 4-19 vis_fpackftx() operation 
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4.7.4 visJexpandQ 

Description 

Converts four unsigned 8 bit eiexnentsto four 16 bit fixed elements. 
Syntax 

vi3_d64 f expand (via_*32 data_4_ffJ; 
Description 

vis_fexpandO converts packed format data e.g. raw pixel data to a 
partitioned format vis_fexpandO takes four 8-bit unsigned elements 
within data_4J, converts each integer to a 16-bit fixed value by inserting 
four zeroes to the right and to the left of each byte, and returns four 16-bit 
elements within a 64 bit result Since the various vis_£mnl8xl60 
instructions can also perform this function, vis_fexpandO is mainly used 
when the first operation to be used on die expanded data is an addition or 
a comparison. Figure 4-20 illustrates die visjrexpandO operation. 
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Figure 4-20 vis JexpandO operation 
Example 

v±s_d64 data 4 16, result 4 16; 

via_f32 data^O, factor? 

re3ult_4_16 - via_fexpand(ciata_2_32) ; 
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/•Oaing vis fmul8xl6al to perform the same function*/ 
factor - via" float_(0x0010> ; 

result_4_X6 - V i3_fmul_8xl6alCdata_2_32, factor); 
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4J.7 vis_alignaddr(), vis_faligndata() 

Function 

Calculate 8 byte aligned address and extra a an arbitrary ft bytes from two 
3 byte aligned addresses. 

Syntax 

void • vi3_aliqrnaddrivoid -addr, offset) ; 

vis_d64 via_f ilicndaca ivxs_d64 daca_fci. via_d64 daca_ic) ; 

Description 

vis.alignaddrO and vis_f align data 0 are usually used together. 
vis.alignaddrO ralces an arbitrarily aligned pointer addr and a signed 
integer D#sef„adds them, places the rightmost three bits of the result in the 
address offset field of the GSR and returns the result with the rightmost 3 
bits set to 0. This return value can then be used as an 8 byre aligned 
address for loading or storing a vis_d64 variable. An example is shown in 
Figure 4-22. 
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vis_alignaddrfxl0005, 0) returns xlOOOO with 5 placed in the GSR offset field. 
vis_alignaddi<xl0005, -2) returns xlOOOO with 3 placed in the GSR offset field. 

Figure 4-22 vis_alignaddrO example. 

vis_f align da taO takes two vis_d64 arguments datajii and data Jo. It 
concatenates these two 64 bit values as datajii, which is the upper half of 
the concatenated value, and data Jo, which is the lower half of the 
concatenated value. Bytes in this value are numbered from most significant 
to the least significant with the most significant byte being 0. The return 
value is a vis_d64 variable representing eight bytes extracted from the 
concatenated value with the most significant byte specified by the GSR 
offset field as illustrated in Figure 4-23. 
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4. Using the VIS 
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xlOOOO X10008 

xlOOOS xlOOOC 

vis_faligndata<data Jii, datajo) returns required data segment 

Figure 4-23 vis.raligndataO example. 

Care must be taken not to read past the end of a iegal segment of memory. 
A legal segment can only begin and end on page boundaries, and so if any 
bvie or a vis_d64 lies within a valid page, the entire vis_d64 must lie within 
the page. However, when addr is already 3 byte aligned, the GSR alignment 
bits will be set to 0 and no byte of datajo will be used. Therefore even 
though it is legal to read 8 bytes starting at addr, it may not be legal to read 
16 bytes and this code will fail. This problem may be avoided in a number 
of ways: 
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addr may be compared with some known address of the last legal byte; 

the final iteration of a loop, which may need to read past the end of the legal 

data, may be special-cased; 

siighdv more memory than needed may be allocated to ensure that there are 
valid bvtes available after the end of the data. 



Example 



The following exampie illustrates how these instructions may be used 
together to read a group of eight bytes from an arbitrarily-aligned address 
'addr', as follows: 
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void * addr , *addr_aiigned; 
via_d64 data_hi, cata_lo, data; 

addr^aiigned - vi3_aiignaddr taddr , 2) ; 

dara_hi - addr_ai-=ned[0) ; 

data_io - addr^aiigaedtll ; 

data - vi3_faiicndata(data_hi, data_io) ; 
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Visual Instruction Set User's Guide 

When data are being accessed in a stream, it is not necessary to perform ail 
the steps shown above for each vis_d64. Instead, the address may be 
aligned once and oniy one new vis_d64 read per iteration: 

addr_aiigned - vi3_alicnaddr (addr, 0) ; 
data_h i - addr_a 1 i gned { 0 ] ; 
data_l o - addr_aligned (11; 
for <i - 0; x < times ; — i) f 

data - vj.3_f alignriata (data_h^, data_io> ; 
/• Use data here. •/ 

/* Move data "window" to the right. */ 
data_hi - data_lo; 
data_io - addr_aiicned(i - 2); 

i 

Of course, the same considerations concerning read ahead apply here. In 
general, it is -best not to use vis.aiigiuddrO to generate an address within 
an inner loop, e.g., 

{ * 

addr _a 1 i gned * vi3_alignaddr < add ~ , offset); 
data__hi - addr_aiigned [ 0 J ; 
offset 8; 
/♦...•/ 
> 

Since this means that the data cannot be read until the new address has 
been computed. Instead, compute the aligned address once and either 
increment it directly or use array notation. This will ensure that the address 
arithmetic isperformed in the integer units in parallel with the execution of 
the VIS instructions. 
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Claims 

1 . In a computer system, a method of alpha blending images, comprising the steps of: 
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loading a first word, comprising a plurality of word components into a processor in parallel, each word com- 
ponent associated with a sourcel pixel of a first source image; 

loading a second word comprising plurality of word components into a processor in parallel, each word com- 
ponent associated with a source2 pixel of a second source image; 

loading a third word comprising plurality of word components into a processor in parallel, each word component 
associated with a control pixel of a control image; 

alpha blending the components of said first, second, and third words in parallel to generate word components 
of a fourth, with the word components of said fourth word associated with the destination pixels of an alpha 
blended destination word; 

storing the word components of said fourth word to an unaligned area of a memory in parallel. 

The method of claim 1 wherein: 

said step of alpha blending comprises the step of arithmetically combining corresponding sourcel, source2, 
and control pixels according to a predetermined formula to generate a corresponding destination pixel. 

The method of claim 2 further comprising: 

specifying a precision value for each of said sourcel, source2, control, and destination pixels; 

reordering operations and terms of said predetermined formula to achieve the precision value and increase 

efficiency of said alpha blending step. 

The method of claim 2 wherein: 

said step of arithmetically combining utilizes predefined partitioned add, multiply, and subtract operations to 
operate on components of said first, second, and third words in parallel; 

reordering operations and terms of said predetermined formula to increase the efficiency of operation of said 
predetermined partitioned operations. 

In a computer system, a method of blending first and second source images to generate a destination image 
utilizing a control image where anyone of said images is an unaligned image stored in a memory having boundaries 
unaligned with the addresses of said memory and where said images comprise words including multiple pixels, 
and with each pixel in said control image comprising a control pixel number of bits, said method comprising the 
steps of: 

loading a first word from said first and second source images and said control image, and, if one of said images 
is an unaligned image; 

generating an aligned address immediately preceding an unaligned address of a first word in said unaligned 
image; 

calculating an offset being the difference between said aligned address and said unaligned address; 

utilizing said unaligned address and said offset to load a word from said unaligned image; 

expanding a subset of the pixels in the first word of said first and second source images into expanded pixels 

having equal numbers of leading and trailing zeros to form an expanded first word including said expanded 

pixels; 

performing a partitioned subtraction operation to subtract corresponding expanded pixels in said first expanded 
words of said first and second source images to form an expanded difference word including expanded dif- 
ference components; 

performing a partitioned multiplication of corresponding pixels in said first word of said control image and said 
corresponding expanded difference components to form an expanded product word comprising expanded 
product components, with each expanded product component including the same number of leading zeros as 
said expanded pixel and having said control pixel number of least significant bits truncated to effect division 
by 2 raised to the power of said control pixel number; 

performing a partitioned sum of said expanded product word and said first expanded word first source image 
to form a first expanded ha If word of said destination image comprising expanded destination components; 
packing said destination components of said expanded halfword to form a subset of the pixels of said desti- 
nation word. 

In a computer system, a method of blending first and second source images to generate a destination image 
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utilizing a control image where any one of said images is an unaligned image stored in a memory having boundaries 
unaligned with the addresses of said memory and where said images comprise words including multiple pixels, 
and with each pixel in said control image comprising a control pixel number of bits, said method comprising the 
steps of: 

loading a first word from said first and second source images and said control image, and, if one of said images 
is an unaligned image; 

generating an aligned address immediately preceding an unaligned address of a first word in said unaligned 
image; 

calculating an offset being the difference between said aligned address and said unaligned address; 
utilizing said unaligned address and said offset to load a word from said unaligned image; 
defining first and second constants equal to 0x80008000 and OxOOffOOff respectively; 
performing a partitioned subtraction operation to subtract corresponding components of said first constant 
from said components in said first word of said control image in parallel to form a first difference word; 
returning. an aligned data word comprising components of said second constant and components of said first 
difference word; 

performing a logical AND operation of corresponding components of said aligned data word and said second 
constant to generate a logical result word; 

performing a partitioned packing operation of said logical result word to form a packed logical result word; 
performing a partitioned multiplication of said first word of said control image and said first word of said first 
source image to form a first resulting product word; 

performing a partitioned multiplication of said packed logical result word said first word of said first source 
image to form a second resulting product word; 

performing a partitioned add operation on said first and second resulting product words to form a first sum word; 
performing a partitioned multiplication of said first word of said control image and said first word of said second 
source image to form a third resulting product word; 

performing a partitioned multiplication of said packed logical result word said first word of said second source 
image to form a fourth resulting product word; 

performing a partitioned add operation on said third and fourth resulting product words to form a second sum 
word; 

performing a partitioned add operation of said first and second sum words to form said first destination word; 
computing an edge mask to store said destination word to said unaligned destination image; 
utilizing said edge mask to perform a partial store of said destination word to said unaligned destination image. 

35 7. An apparatus for forming a composite image by blending two digital images, the apparatus characterised by: 

means for loading a first word into a data processor, the first word comprising a plurality of word components, 
each word component associated with a source 1 pixel of a first source image; 

means for loading a second word into the data processor, the second word comprising plurality of word com- 
40 ponents, each word component associated with a source 2 pixel of a second source image; 

means for loading a third word into the data processor, the third word comprising plurality of word components, 
each word component associated with a control pixel of a control image; 

means for blending the components of said first, second, and third words in parallel to generate word compo- 
nents of a fourth word, with the word components of said fourth word associated with the destination pixels of 
45 a blended destination word: and 

means for storing the word components of said fourth word to an unaligned area of a memory in parallel. 

8. An apparatus as claimed in claim 9, characterised in that the composite image is a destination image and any one 
of said images is an unaligned image stored in a memory having boundaries unaligned with the addresses of said 

50 memory and where said images comprise words including multiple pixels, and with each pixel in said control image 

comprising a control pixel number of bits, and wherein if one of said images is an unaligned image the apparatus 
is further arranged to generate an aligned address immediately preceding an unaligned address of a first word in 
said unaligned image to calculate an offset being the difference between said aligned address and said unaligned 
address; and to utilize said unaligned address and said offset to load a word from said unaligned image. 

55 

9. An apparatus as claims in claims 9 and 10, characterised by being further arranged to expand a subset of the 
pixels in the first word of said first and second source images into expanded pixels having equal numbers of leading 
and trailing zeros to form an expanded first word including said expanded pixels; 
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to perform a partitioned subtraction operation to subtract corresponding expanding pixels in said first expanded 
words of said first and second source images to form an expanded difference word including expanded dif- 
ference components; 

to perform a partitioned multiplication of corresponding pixels in said first word of said control image and said 
corresponding expanded difference components to form an expanded product word comprising expanded 
product components, with each expanded product component including the same number of leading zeros as 
said expanded pixel and having said control pixel number of least significant bits truncated to effect division 
by 2 raised to the power of said control pixel number; 

to perform a partitioned sum of said expanded product word and said first expanded word first source image 
to form a first expanded halfword of said destination image comprising expanded destination components; and 
to pack said destination components of said expanded halfword to form a subset of the pixels of said destination 
word. 

An apparatus as claimed in claim 7, characterised in that any one of said images is an unaligned image stored in 
a memory having boundaries unaligned with the addresses of said memory and where said images comprise 
words including multiple pixels, and with each pixel in said control image comprising a control pixel number of bits, 
the apparatus being arranged if one of said images is an unaligned image; 

to generate an aligned address immediately preceding an unaligned address of a first word in said unaligned 
image; 

to calculate an offset being the difference between said aligned address and said unaligned address; 
to utilize said unaligned address and said offset to load a word from said unaligned image. 

An apparatus as claimed in claims 9 or 12, characterised in being arranged to define first and second constants 
equal to 0x80008000 and OxOOffOOff respectively; 

to perform a partitioned subtraction operation to subtract corresponding components of said first constant from 
said components in said first word of said control image in parallel to form a first difference word; 
to return an aligned data word comprising components of said second constant and components of said first 
difference word; 

to perform a logical AND operation of corresponding components of said aligned data word and said second 
constant to generate a logical result word; 

to perform a partitioned packing operation of said logical result word to form a packed logical result word; 
to perform a partitioned multiplication of said first word of said control image and said first word of said first 
source image to form a first resulting product word; 

to perform a partitioned multiplication of said packed logical result word said first word of said first source 
image to form a second resulting product word; 

to perform a partitioned add operation on said first and second resulting product words to form a first sum word; 
to perform a partitioned multiplication of said first word of said control image and said first word of said second 
source image to form a third resulting product word; 

to perform a partitioned multiplication of said packed logical result word said first word of said second source 
image to form a fourth resulting product word; 

to perform a partitioned add operation on said third and fourth resulting product words to form a second sum 
word; 

to perform a partitioned add operation of said first and second sum words to form said first destination word; 

to compute an edge mask to store said destination word to said unaligned destination image; and 

to utilize said edge mask to perform a partial store of said destination word to said unaligned destination image. 
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